cuda.core: convert peer_accessible_by to a live MutableSet view#2018
cuda.core: convert peer_accessible_by to a live MutableSet view#2018Andy-Jost wants to merge 11 commits intoNVIDIA:mainfrom
Conversation
DeviceMemoryResource.peer_accessible_by previously returned a sorted
tuple[int, ...] backed by a Python-level cache, which was prone to
divergence from driver state across multiple wrappers around the same
memory pool. The setter accepted Device | int and emitted a single
batched cuMemPoolSetAccess covering the diff against the cache.
This commit replaces the property with a live driver-backed view:
- Adds PeerAccessibleBySetProxy in _memory/_peer_access_utils.py, a
collections.abc.MutableSet whose reads call cuMemPoolGetAccess and
whose writes call cuMemPoolSetAccess. Iteration yields Device
objects; add, discard, and __contains__ accept either a Device or a
device-ordinal int. The proxy is constructed fresh on every property
access, so there is nothing to cache or pickle.
- Drops the _peer_accessible_by cache field (and its initializations
in __cinit__, _DMR_init, and from_allocation_handle), eliminating
the owned/non-owned read split. All pools now share the same code
path and always query the driver.
- All bulk operations on the proxy (update, |=, &=, -=, ^=, clear,
pop) issue exactly one cuMemPoolSetAccess call. Peer-access
transitions can take seconds per pool because every existing memory
mapping is updated, so coalescing into a single driver call lets the
toolkit handle the mappings in parallel. The property setter
(mr.peer_accessible_by = [...]) preserves its original single-call
behavior via the same shared planner path.
- Single-element add validates can_access_peer through
plan_peer_access_update, matching the existing setter contract.
This is a breaking change captured in the v1.0.0 release notes.
Callers comparing against tuples must update to set comparisons
(mr.peer_accessible_by == {Device(0)}). Existing tests are migrated;
new tests for set-interface conformance are intentionally deferred to
a follow-up.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
FYI: A breaking change must be labeled with P0. |
…s.pyx The previous commit left DeviceMemoryResource carrying three pass-through def methods (_query_peer_access_ids, _peer_access_includes, _apply_peer_access_diff) whose only purpose was to give the pure-Python proxy in _peer_access_utils.py a way to call cdef helpers in _device_memory_resource.pyx. These methods served no public role and cluttered the class API. Promote _peer_access_utils.py to a Cython module so the proxy and the driver-touching helpers can live together: - Convert _peer_access_utils.py to _peer_access_utils.pyx. cimports cydriver and DeviceMemoryResource from the .pxd; uses nogil and direct CUmemAccessDesc packing identically to before. - Move _DMR_query_peer_access_ids, _DMR_peer_access_includes, _DMR_apply_peer_access_diff, and _DMR_replace_peer_accessible_by from _device_memory_resource.pyx into the new module as cdef helpers (and a cpdef replace_peer_accessible_by entry point used by the property setter). - Drop the three pass-through def methods from DeviceMemoryResource. The class is left with the property getter and setter only; everything else is module-level in _peer_access_utils. - The proxy now calls the module-level cdef helpers directly instead of routing through methods on mr. No behavior change. The public surface (PeerAccessibleBySetProxy, plan_peer_access_update, normalize_peer_access_targets, PeerAccessPlan) is preserved at the same import paths. Co-authored-by: Cursor <cursoragent@cursor.com>
Refactor _query_peer_access_ids so the entire driver loop runs inside a single nogil block instead of acquiring/releasing the GIL once per device. The flag query now uses a cached as_cu(mr._h_pool) handle and fills a libcpp.vector[int]; because range(total) ascends, the result is already sorted and the trailing sorted() call is dropped. Also tighten the peer_accessible_by entry in 1.0.0-notes.rst: the breaking-change blurb only needs to state the type/element change, so remove the implementation-flavored details about input acceptance and batched cuMemPoolSetAccess calls. Co-authored-by: Cursor <cursoragent@cursor.com>
…ge cases Existing peer-access tests covered the integration path well (real copies across peers, the full transition matrix, shared-pool consistency) but only touched ``in``, ``==``, and the property setter on the new set proxy. After the v1.0.0 break that surfaced ~25 ``MutableSet`` methods, nothing was pinning the type-coercion contract, the owner-filtering behavior, the ``KeyError``/value-error paths, or the "one ``cuMemPoolSetAccess`` per bulk op" performance invariant. Add the following coverage in ``test_memory_peer_access.py``: - A ``MutableSet`` conformance test using a relaxed ``assert_mutable_set_interface`` mode that admits subjects holding at most one insertable element. CI maxes at two GPUs (one peer), so the multi-element protocol pass cannot run there. The new ``support_multi_insert=False`` path takes one insertable item plus two non-member sentinels and exercises every ``MutableSet`` method (``add``/``discard``/``remove``/``pop``/``clear``/``update``, comparisons, isdisjoint, subset/superset, binary and in-place operators, ``__iter__``/``__len__``/``__repr__``). - ``Device``/``int`` interchangeability on ``add``/``discard``/``__contains__``. - The owner-device filtering contract on every write (silent no-op). - Error paths: ``add(out_of_range)`` and ``add(non_coercible)`` raise while the lenient ``discard``/``__contains__`` paths swallow the same inputs; ``remove(non_member)`` raises ``KeyError``. - "Live driver view" semantics: a proxy obtained before another wrapper modifies the pool reflects the change with no refresh step. - ``__iter__`` ordering is ascending by ``device_id`` and elements are ``Device`` instances; ``__repr__`` includes the class name and tracks live contents; the getter returns the documented proxy type. - A batching spy that monkeypatches the module-level ``_apply_peer_access_diff`` and asserts that every bulk op (``|=``/``&=``/``-=``/``^=``/``update``/``difference_update``/ ``clear``) and the property setter issues at most one driver call, zero when the diff is empty. To make the spy possible, ``_apply_peer_access_diff`` is now a Python-visible ``def`` wrapper around a renamed ``_apply_peer_access_diff_cython`` ``cdef inline``. The proxy and the property setter still call ``_apply_peer_access_diff`` by bare name, which Cython resolves through the module's globals at runtime, so a ``monkeypatch.setattr(_peer_access_utils, "_apply_peer_access_diff", ...)`` intercepts them. The extra Python-level dispatch is negligible next to ``cuMemPoolSetAccess`` itself. Co-authored-by: Cursor <cursoragent@cursor.com>
Augmented assignment on the ``peer_accessible_by`` property
(``dmr.peer_accessible_by |= {...}``) is two trips through the
proxy/setter pair, not one: Python fetches the proxy, the proxy mutates
itself in place via ``__ior__``, and Python then assigns the
(already-mutated) proxy back through the setter. That trailing setter
call computes the diff against current driver state, finds it empty,
and short-circuits inside the ``cdef inline`` before issuing any
``cuMemPoolSetAccess`` work — so the *driver-level* contract ("one
batched call per bulk op") still holds, but the wrapper is invoked
twice, which the spy was counting.
Also, the fixture's ``dmr.peer_accessible_by = []`` reset on an already
empty pool is itself an empty-delta wrapper call.
Filter the recorded calls down to those with non-empty deltas (the ones
that translate to real driver work) and switch the bulk-ops test to use
a locally bound proxy so augmented assignment goes through ``__ior__``
once with no extra setter invocation. The setter test stays on
``dmr.peer_accessible_by = ...`` because that is the public API
contract under test there.
Co-authored-by: Cursor <cursoragent@cursor.com>
…hing
Move the actual ``cuMemPoolSetAccess`` invocation (descriptor-array
build + driver call) into a thin Python-visible ``def _set_pool_access``
in ``_peer_access_utils.pyx``. ``_apply_peer_access_diff`` now does only
the empty-diff short-circuit and delegates the work to
``_set_pool_access``, which Cython resolves through the module globals
at runtime so tests can intercept it via ``monkeypatch.setattr``.
Replace the previous internal-wrapper spy with a driver-call spy that
counts every real ``cuMemPoolSetAccess`` invocation. Earlier no-op
layers (e.g. the augmented-assignment-on-property pattern that writes
an already-mutated proxy back through the setter) short-circuit before
reaching ``_set_pool_access``, so the recorded count is exactly the
number of driver calls. The empty-delta filter and the local-binding
workaround in the bulk-ops test are gone; we now also assert that
``dmr.peer_accessible_by |= {...}`` directly on the property is still
exactly one driver call.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the ``support_multi_insert`` flag and ``non_members`` keyword with two purpose-built helpers: - ``assert_mutable_set_interface(subject, items)`` keeps the original signature and contract: at least five distinct insertable items, exercised against a reference set in the standard way. The graph ``AdjacencySetProxy`` test continues to use this unchanged. - ``assert_single_member_mutable_set_interface(subject, member, non_member)`` is a focused pass for proxies whose backing store admits at most one insertable element at a time (here, the peer-access view on a 2-GPU box). It threads a single member and one non-member sentinel through every ``MutableSet`` method. The two helpers share small private utilities (empty-state checks, ``__repr__`` shape) but keep their public surfaces small and linear. A capacity-one proxy is a meaningfully different contract from a general mutable set; naming that explicitly in the API reads better than a flag and avoids forcing call sites to plumb sentinels through. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
| def assert_single_member_mutable_set_interface(subject, member, non_member): | ||
| """Exercise every MutableSet method on a subject with capacity one. | ||
|
|
||
| Use this for proxies whose backing store admits at most one insertable | ||
| element at a time (typically because the underlying resource is bounded | ||
| by hardware, e.g. a peer-access view on a system with a single valid | ||
| peer device). The subject only ever holds ``set()`` or ``{member}``; | ||
| *non_member* supplies the right-hand side of comparisons, ``isdisjoint``, | ||
| subset/superset, and binary/in-place operators so every ``MutableSet`` | ||
| method is exercised at least once. |
|
|
Q: It is a big diff with a very niche use case. Are we sure this is P0? |
The peer access list is currently a tuple. This change aligns with the set proxies in the graph module ( This change is not functionally as big as it might appear. (1) Most of the bulk is just code movement: I moved the peer access implementation from |
…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst
…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst # cuda_core/docs/source/release/1.0.0-notes.rst
…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/cuda/core/_memory/_device_memory_resource.pyx
Summary
DeviceMemoryResource.peer_accessible_bypreviously returned a sortedtuple[int, ...]backed by a Python-level cache. This is fragile in two ways: the cache can diverge from driver state across multiple wrappers around the same memory pool (#1720), and tuple semantics force callers into bespoke splice/replace patterns instead of standard set operations. This PR replaces the property with a live driver-backedcollections.abc.MutableSetview, matching the proxy patterns established byAdjacencySetProxy(graphs) and the in-flightAccessedBySet(managed-memory advice, #1775).This is a breaking change, captured in the v1.0.0 release notes.
Changes
New:
PeerAccessibleBySetProxyA
MutableSetsubclass living incuda_core/cuda/core/_memory/_peer_access_utils.pyx.__contains__,__iter__,__len__) callcuMemPoolGetAccess.add,discard, and bulk ops) callcuMemPoolSetAccess.Deviceobjects.add,discard,__contains__acceptDevice | int.update,|=,&=,-=,^=,clear) issue exactly onecuMemPoolSetAccesscall. This matters: peer-access transitions can take seconds per pool because every existing memory mapping is updated, so coalescing into a single driver call lets the toolkit handle the mappings in parallel.The proxy is constructed fresh on every property access. There is nothing to cache or pickle.
Cache removal
The
_peer_accessible_byfield onDeviceMemoryResourceis dropped, along with its initializations in__cinit__,_DMR_init, andfrom_allocation_handle. The owned/non-owned read split is gone — every read path now queries the driver directly. This eliminates the bug class fixed in #1720 and simplifies the implementation.Module consolidation
All peer-access internals now live in
_peer_access_utils.pyx(promoted from a.py). Thecdef inlinedriver helpers (_query_peer_access_ids,_peer_access_includes,_set_pool_access) and thecpdef replace_peer_accessible_bysetter helper moved out of_device_memory_resource.pyx.DeviceMemoryResourcenow exposes only the publicpeer_accessible_byproperty: the getter returnsPeerAccessibleBySetProxy(self)and the setter delegates toreplace_peer_accessible_by(self, devices). This keeps the DMR surface focused on its memory-management role.Property setter
mr.peer_accessible_by = [...]is preserved and unchanged in behavior. It still does a single batchedcuMemPoolSetAccessvia the same sharedplan_peer_access_updatepath the proxy uses for bulk ops._query_peer_access_idsGIL optimizationThe driver loop that enumerates peer device IDs now runs inside a single
nogilblock instead of acquiring/releasing the GIL once per device. Per-call data is collected into alibcpp.vector[int], and becauserange(total)ascends the result is already sorted (so the trailingsorted()is gone). This is a local performance tweak with no behavior change.Test coverage
The existing integration tests (
test_memory_peer_access.py,memory_ipc/test_peer_access.py) migrated from tuple equality to set-of-Deviceequality. Eight new tests pin the proxy's contract end-to-end, all using the existingmempool_device_x2fixture so they run on CI:assert_single_member_mutable_set_interface(subject, member, non_member)helper exercises everyMutableSetmethod against subjects whose backing store admits at most one insertable element (the peer-access proxy can hold at most{dev1}on a 2-GPU box). The originalassert_mutable_set_interface(subject, items)is unchanged for normal multi-element subjects (AdjacencySetProxy).Device/intinterchangeability onadd/discard/__contains__.add(out_of_range)andadd(non_coercible)raise;discard/__contains__swallow the same inputs;remove(non_member)raisesKeyError.device_id; elements areDeviceinstances;__repr__shape; getter return type.monkeypatchspy on_set_pool_access(the thin Python-visible helper extracted from_apply_peer_access_diffpurely so tests can spy on the actual driver call). Every bulk op (|=/&=/-=/^=/update/difference_update/clear) and the property setter is asserted to issue exactly onecuMemPoolSetAccess, zero on no-op. This includes thedmr.peer_accessible_by |= {...}augmented-assignment-on-property pattern, where the empty-diff short-circuit prevents a second driver call from the trailing setter write.Breaking change
DeviceMemoryResource.peer_accessible_byno longer returns atuple[int, ...]. Callers must update:mr.peer_accessible_by == (0,)) -> set comparisons (mr.peer_accessible_by == {Device(0)}).Deviceobjects (use[d.device_id for d in mr.peer_accessible_by]if int IDs are needed).Setter usage (
mr.peer_accessible_by = [...]) is unchanged.Test plan
test_memory_peer_access.pytests pass after migration to set semantics.test_memory_peer_access_utils.pytests still pass (plan_peer_access_updateandnormalize_peer_access_targetsare unchanged).memory_ipc/test_peer_access.pytests pass (IPC-enabled pools, parent-child IPC scenarios).cuda_coretest suite passes on a single-GPU box (2931 passed, 212 skipped — peer-access tests gated onmempool_device_x2/x3).pre-commit run --all-filespasses (ruff, format, SPDX, cython-lint, RST checks).Related work
AccessedBySet, the second driver-backed set proxy incuda.core.cuda_core/cuda/core/graph/_graph_node.pyx(AdjacencySetProxy) — reference pattern for driver-backed set proxies. SameMutableSetshape, same_from_iterableplain-set fallback, same single-driver-call bulk-op contract.